Combining amplitude and phase-based features for speaker verification with short duration utterances
نویسندگان
چکیده
Due to the increasing use of fusion in speaker recognition systems, one trend of current research activity focuses on new features that capture complementary information to the MFCC (Mel-frequency cepstral coefficients) for improving speaker recognition performance. The goal of this work is to combine (or fuse) amplitude and phase-based features to improve speaker verification performance. Based on the amplitude and phase spectra we investigate some possible variations to the extraction of cepstral coefficients that produce diversity with respect to fused subsystems. Among the amplitude-based features we consider widely used MFCC, Linear frequency cepstral coefficients, and multitaper spectrum estimationbased MFCC (denoted here as MMFCC). To compute phasebased features we choose modified group delayand all-pole group delay-, linear prediction residual phase-based features. We also consider product spectrum-based cepstral coefficients features that are influenced by both the amplitude and phase spectra. For performance evaluation, text-dependent speaker verification experiments are conducted on the a proprietary dataset known as Voice Trust-Pakistan (VT-Pakistan) corpus. Experimental results show that the fused system provide reduced error rate compared to both the amplitude and phasebased features. On the average fused system provided a relative improvement of 37% over the baseline MFCC systems in terms of EER, DCF (detection cost function) of SRE 2008 and DCF of SRE 2010.
منابع مشابه
Speaker Identification by Combining Various Vocal Tract and Vocal Source Features
Previously, we proposed a speaker recognition system using a combination of MFCC-based vocal tract feature and phase information which includes rich vocal source information. In this paper, we investigate the efficiency of combination of various vocal tract features (MFCC and LPCC) and vocal source features (phase and LPC residual) for normal-duration and short-duration utterance. The Japanese ...
متن کاملTransfer Learning for Speaker Verification on Short Utterances
Short utterance lacks enough discriminative information and its duration variation will propagate uncertainty into a probability linear discriminant analysis (PLDA) classifier. For speaker verification on short utterances, it can be considered as a domain with limited amount of long utterances. Therefore, transfer learning of PLDA can be adopted to learn discriminative information from other do...
متن کاملCombining source and system information for limited data speaker verification
Speaker verification using limited data is always a challenge for practical implementation as an application. An analysis on speaker verification studies for an i-vector based method using Mel-Frequency Cepstral Coefficient (MFCC) feature shows that the performance drops drastically as the duration of test data is reduced. This decrease in performance is due to insufficient phonetic coverage wh...
متن کاملAnalysing the performance of Speaker Verification task using different features pdfkeywords=Mel Frequency Cepstral Coefficient(MFCC), Linear Predictive Cepstral Coefficient(LPCC), Perceptual Linear Predictive(PLP), Equal Error Rate(EER)
Speaker recognition is the identification of the person who is speaking by characteristics of their voices, also called “voice recognition”. The components of Speaker Recognition includes Speaker Identification(SI) and Speaker Verification(SV). Speaker identification is the task of determining an unknown speakers identity. If the speaker claims to be of a certain identity and the voice is to ve...
متن کاملDeep Speaker Embeddings for Short-Duration Speaker Verification
The performance of a state-of-the-art speaker verification system is severely degraded when it is presented with trial recordings of short duration. In this work we propose to use deep neural networks to learn short-duration speaker embeddings. We focus on the 5s-5s condition, wherein both sides of a verification trial are 5 seconds long. In our previous work we established that learning a non-...
متن کامل